# Get starting time
starting_time <- Sys.time()

################################################################################
# List of packages that we need
packages <- c("ggplot2", "mia", "miaViz", "dplyr", "tidyr", "scater")

# Get packages that are already installed installed
packages_already_installed <- packages[ packages %in% installed.packages() ]

# Get packages that need to be installed
packages_need_to_install <- setdiff( packages, packages_already_installed )

# Loads BiocManager into the session. Install it if it not already installed.
if( !require("BiocManager") ){
    install.packages("BiocManager")
    library("BiocManager")
}

# If there are packages that need to be installed, installs them with BiocManager
# Updates old packages.
if( length(packages_need_to_install) > 0 ) {
   install(packages_need_to_install, ask = FALSE)
}

# Load all packages into session. Stop if there are packages that were not
# successfully loaded
if( any(!sapply(packages, require, character.only = TRUE)) ){
    stop("Error in loading packages into the session.")
}

################################################################################
# Additional setup

# Set black and white theme for figures, and Arial font
theme <- theme_bw() + theme(text = element_text(family = "Arial"), 
                            panel.border = element_blank(), 
                            panel.grid.major = element_blank(),
                            panel.grid.minor = element_blank(), 
                            axis.line = element_line(colour = "black"))
theme_set(theme)

EuroBioC2023

Presenter information

All authors are affiliated to Turku Data Science Group in University of Turku, Finland.


Learning goals

  1. Microbiome research studies interactions between microbes (and human, environment…)
  2. Big data requires efficient tools to manipulate the data
  3. miaverse is a SummarizedExperiment framework for microbiome analytics
Figure source: Moreno-Indias et al. (2021) Statistical and Machine Learning Techniques in Human Microbiome Studies: Contemporary Challenges and Solutions. Frontiers in Microbiology 12:11.
Figure source: Moreno-Indias et al. (2021) Statistical and Machine Learning Techniques in Human Microbiome Studies: Contemporary Challenges and Solutions. Frontiers in Microbiology 12:11.

Motivation

Microbiome research

  • Microbiome is a composition of microbes in well-defined area (gut, skin, mouth…)
  • Bilateral interaction between human and microbiome –> affects both health and disease.
  • The research is based on sequencing (characterization of genes and species).
  • Nowadays, multiomics approach is more common (integration of taxonomy information with metabolite data, for example)
  • Computational methods are the new microscope
  • The research has expanded rapidly in previous years
# Plot publication graph
path <- "data/PubMed_Timeline_Results_by_Year.csv"
df <- read.csv(path, skip = 1)

x <- "Year"
y <- "Count"

plot <- ggplot(df, aes(x = .data[[x]], y = .data[[y]])) +
    geom_bar(stat="identity")
plot
PubMed publications per year with a search term 'microbiome' (fetched: Sep 5, 2023)

PubMed publications per year with a search term ‘microbiome’ (fetched: Sep 5, 2023)

Big data

  • Most of the cohort studies require multiple types of data, for instance microbiome, phenotype and other omics data.
  • Linking these types of data could present multiple challenges; to mention:
    • Different sources the data is coming form, could use different types of key identifier of the samples or subjects of the study; which could poses a problem in some cases for linking the data together.
    • The choice of the data structure to combine and store these mulitple types of data is critical as well, in terms of the management, handling and wrangling of the data; to be able to perform the analysis.
  • Cohort datasets are large in size, which mostly one cannot conduct an analysis using a normal computer with limited hardware; HPC and cloud computing services are required in this case.
  • Last but not least, a quality check reporting of the constructed object is substantially important to verify the correctness of the building process.

Regarding the choice of the data strcuture to handle cohorts with muliple data types or sources MultiAssayExperiment, is considered to be very liable in such situation, easing-up the management and wrangling of the data; once is construced correctly. Moreover, several R packages frameworks are increasingly integrating MultiAssayExperiment and the SummarizedExperimentclass, to provide users with a reliable and ease of use data structure.

miaverse (MIcrobiome Analysis)

The structure of the TreeSummarizedExperiment (TreeSE) class.
The structure of the TreeSummarizedExperiment (TreeSE) class.

The workflow

Importing the dataset

We get the data from MGnify database. It is a EMBL-EBI’s database for metagenomic data. This large microbiome database can be accessed with MGnifyR package which nowadays support TreeSE. The package will be submitted to Bioconductor’s next release.

We chose this dataset…

As loading takes some time, the dataset is already loaded.

For other available datasets and importing methods, check OMA.

# library(MGnifyR)
# mg <- MgnifyClient()
# 
# analyses <- searchAnalysis(mg, "studies", "MGYS00005128")
# analyses <- searchAnalysis(mg, "studies", "MGYS00000596")
# mae <- getResult(mg, analyses)
library(mia)
data("HintikkaXOData")
MAE <- HintikkaXOData
MAE
## A MultiAssayExperiment object of 3 listed
##  experiments with user-defined names and respective classes.
##  Containing an ExperimentList class object of length 3:
##  [1] microbiota: TreeSummarizedExperiment with 12706 rows and 40 columns
##  [2] metabolites: TreeSummarizedExperiment with 38 rows and 40 columns
##  [3] biomarkers: TreeSummarizedExperiment with 39 rows and 40 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save data to flat files

As we can see, our data contains three different types of data: microbiota, metabolites and biomarkers; which are all of the TreeSummarizedExperiment class combined in ourMultiAssayExperiment object.

Microbiota:

MAE[["microbiota"]] # or MAE[[1]]
## class: TreeSummarizedExperiment 
## dim: 12706 40 
## metadata(0):
## assays(1): counts
## rownames(12706): GAYR01026362.62.2014 CVJT01000011.50.2173 ... JRJTB:03787:02429 JRJTB:03787:02478
## rowData names(7): Phylum Class ... Species OTU
## colnames(40): C1 C2 ... C39 C40
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL

Metabolites:

MAE[["metabolites"]] # or MAE[[2]]
## class: TreeSummarizedExperiment 
## dim: 38 40 
## metadata(0):
## assays(1): nmr
## rownames(38): Butyrate Acetate ... Malonate 1,3-dihydroxyacetone
## rowData names(0):
## colnames(40): C1 C2 ... C39 C40
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL

Biomarkers:

MAE[["biomarkers"]] # or MAE[[3]]
## class: TreeSummarizedExperiment 
## dim: 39 40 
## metadata(0):
## assays(1): signals
## rownames(39): Triglycerides_liver CLSs_epi ... NPY_serum Glycogen_liver
## rowData names(0):
## colnames(40): C1 C2 ... C39 C40
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL

Additionally, the MAE contains phenotype or metadata about samples or subjects, that could be accessed as follows:

colData(MAE)
## DataFrame with 40 rows and 6 columns
##          Sample      Rat        Site          Diet      Fat       XOS
##     <character> <factor> <character>      <factor> <factor> <numeric>
## C1           C1        1       Cecum      High-fat     High         0
## C2           C2        2       Cecum      High-fat     High         0
## C3           C3        3       Cecum      High-fat     High         0
## C4           C4        4       Cecum      High-fat     High         0
## C5           C5        5       Cecum      High-fat     High         0
## ...         ...      ...         ...           ...      ...       ...
## C36         C36       36       Cecum Low-fat + XOS      Low         1
## C37         C37       37       Cecum Low-fat + XOS      Low         1
## C38         C38       38       Cecum Low-fat + XOS      Low         1
## C39         C39       39       Cecum Low-fat + XOS      Low         1
## C40         C40       40       Cecum Low-fat + XOS      Low         1

Exploring the microbiome data:

Taxa Counts

library(dplyr)
rowData(MAE[[1]]) %>% as_tibble() %>% summarise_all(n_distinct)
## # A tibble: 1 × 7
##   Phylum Class Order Family Genus Species   OTU
##    <int> <int> <int>  <int> <int>   <int> <int>
## 1     14    23    43     74   225     319 12706

Top 10 Species

library(miaViz)
# we agglomerate the data from OTUs to species,
# and store it at the alternative experiment slot 
altExp(MAE[[1]], "Species") <- agglomerateByRank(MAE[[1]], rank="Species",
                                                 onRankOnly=TRUE, na.rm = FALSE)
# we transform data from counts to relative abundance
altExp(MAE[[1]], "Species") <- transformAssay(altExp(MAE[[1]], "Species"),
                                          assay.type = "counts",
                                          method = "relabundance")
plotAbundanceDensity(altExp(MAE[[1]], "Species"), assay.type = "relabundance",
                     n=10L, point_size=0.5, layout="density") +
    scale_x_log10() +
    labs(title = "Top 10 Species", x="log10(relabundance) ")+
    theme(plot.title = element_text(hjust = 0.5, size = 12),
          text=element_text(size = 8))
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 15 rows containing non-finite values (`stat_density()`).

Top 10 Genera

# we agglomerate the data from OTUs to species,
# and store it at the alternative experiment slot 
altExp(MAE[[1]], "Genus") <- agglomerateByRank(MAE[[1]], rank="Genus",
                                                 onRankOnly=TRUE, na.rm = FALSE)
# we transform data from counts to relative abundance
altExp(MAE[[1]], "Genus") <- transformAssay(altExp(MAE[[1]], "Genus"),
                                          assay.type = "counts",
                                          method = "relabundance")
plotAbundanceDensity(altExp(MAE[[1]], "Genus"), assay.type = "relabundance",
                     n=10L, point_size=0.5, layout="density") +
    scale_x_log10() +
    labs(title = "Top 10 Genus", x="log10(relabundance) ")+
    theme(plot.title = element_text(hjust = 0.5, size = 12),
          text=element_text(size = 8))

Alpha diveristy

Lets look at ShannonIindex and Observed Richness, where the results are stored back automatically at the colData of the MAE microbiome Species alternative experiment:

altExp(MAE[[1]], "Species") <- estimateDiversity(altExp(MAE[[1]], "Species"),
                                                 index="shannon", 
                                                 assay.type="counts")
altExp(MAE[[1]], "Species") <- estimateRichness(altExp(MAE[[1]], "Species"),
                                                index="observed",
                                                assay.type="counts")
colData(altExp(MAE[[1]], "Species"))
## DataFrame with 40 rows and 2 columns
##       shannon  observed
##     <numeric> <numeric>
## C1    1.68565        79
## C2    1.36891        55
## C3    1.10182        69
## C4    1.18286        69
## C5    1.18953        61
## ...       ...       ...
## C36  1.288436        65
## C37  1.110693        66
## C38  0.937789        60
## C39  1.061396        53
## C40  1.787509        88

Let’s look at the histogram of both indices:

colData(altExp(MAE[[1]], "Species")) %>% as_tibble() %>%
    tidyr::pivot_longer(cols=c(shannon,observed), names_to = "alpha_type",
                        values_to = "estimate") %>% 
    ggplot(aes(x=estimate)) +
    geom_histogram(color = "black", fill = "gray", bins = 30)+
    facet_wrap(~alpha_type, nrow = 1, scales = "free")+
    labs(title = "Alpha Diversity")+
    theme(plot.title = element_text(hjust = 0.5))

Beta diveristy

Let’s look at beta diversity by performing Principal Coordinate Analysis (PCoA) using Bray-Curtis as a distance method on the OTU level data:

library(scater)
MAE[[1]] <- runMDS(MAE[[1]], FUN = vegan::vegdist, name = "PCoA_BC",
              assay.type = "counts")
# Calculating explained variance
e <- attr(reducedDim(MAE[[1]], "PCoA_BC"), "eig");
rel_eig <- e/sum(e[e>0])
# Ploting the ordination 
plotReducedDim(MAE[[1]], "PCoA_BC") + 
    labs(x = paste("PC1 (", round(100 * rel_eig[[1]],1), "%", ")", sep = ""),
         y = paste("PC2 (", round(100 * rel_eig[[2]],1), "%", ")", sep = ""),
         title="PCoA with Bray-Curtis")+
    theme(plot.title = element_text(hjust = 0.5))

With could visualize the same plot using a metadata varible (from colData), to overlay it on our PCoA plot:

# we first assign our colData from MAE to MAE [[1]]
colData(MAE[[1]]) <- colData(MAE)[colnames(MAE[[1]]),]
plotReducedDim(MAE[[1]], "PCoA_BC", colour_by="Fat") + 
    labs(x = paste("PC1 (", round(100 * rel_eig[[1]],1), "%", ")", sep = ""),
         y = paste("PC2 (", round(100 * rel_eig[[2]],1), "%", ")", sep = ""),
         title="PCoA with Bray-Curtis")+
    theme(plot.title = element_text(hjust = 0.5))

Thanks!

  • more material in OMA
  • miaverse logo
  • project website and QR also.
  • Contact info
  • Poster info

Session info

sessionInfo()
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.6 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so;  LAPACK version 3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=fi_FI.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=fi_FI.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=fi_FI.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Helsinki
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] scater_1.29.3                  scuttle_1.11.2                 tidyr_1.3.0                    dplyr_1.1.2                    miaViz_1.9.0                  
##  [6] ggraph_2.1.0                   ggplot2_3.4.2                  BiocManager_1.30.21.1          mia_1.9.9                      MultiAssayExperiment_1.27.0   
## [11] TreeSummarizedExperiment_2.9.0 Biostrings_2.69.2              XVector_0.41.1                 SingleCellExperiment_1.23.0    SummarizedExperiment_1.31.1   
## [16] Biobase_2.61.0                 GenomicRanges_1.53.1           GenomeInfoDb_1.37.2            IRanges_2.35.2                 S4Vectors_0.39.1              
## [21] BiocGenerics_0.47.0            MatrixGenerics_1.13.1          matrixStats_1.0.0             
## 
## loaded via a namespace (and not attached):
##   [1] rstudioapi_0.15.0           jsonlite_1.8.7              magrittr_2.0.3              ggbeeswarm_0.7.2            farver_2.1.1                rmarkdown_2.23             
##   [7] zlibbioc_1.47.0             vctrs_0.6.3                 memoise_2.0.1               DelayedMatrixStats_1.23.0   RCurl_1.98-1.12             ggtree_3.9.0               
##  [13] rstatix_0.7.2               BiocBaseUtils_1.3.2         htmltools_0.5.5             S4Arrays_1.1.5              BiocNeighbors_1.19.0        broom_1.0.5                
##  [19] gridGraphics_0.5-1          SparseArray_1.1.11          sass_0.4.7                  bslib_0.5.0                 htmlwidgets_1.6.2           plyr_1.8.8                 
##  [25] DECIPHER_2.29.0             cachem_1.0.8                igraph_1.5.0.1              lifecycle_1.0.3             pkgconfig_2.0.3             rsvd_1.0.5                 
##  [31] Matrix_1.6-0                R6_2.5.1                    fastmap_1.1.1               GenomeInfoDbData_1.2.10     aplot_0.1.10                digest_0.6.33              
##  [37] ggnewscale_0.4.9            colorspace_2.1-0            patchwork_1.1.2             irlba_2.3.5.1               RSQLite_2.3.1               ggpubr_0.6.0               
##  [43] vegan_2.6-4                 beachmat_2.17.15            labeling_0.4.2              fansi_1.0.4                 polyclip_1.10-4             abind_1.4-5                
##  [49] mgcv_1.9-0                  compiler_4.3.0              bit64_4.0.5                 withr_2.5.0                 backports_1.4.1             BiocParallel_1.35.3        
##  [55] carData_3.0-5               viridis_0.6.4               DBI_1.1.3                   highr_0.10                  ggforce_0.4.1               ggsignif_0.6.4             
##  [61] MASS_7.3-60                 DelayedArray_0.27.10        bluster_1.11.4              permute_0.9-7               tools_4.3.0                 vipor_0.4.5                
##  [67] beeswarm_0.4.0              ape_5.7-1                   glue_1.6.2                  nlme_3.1-162                grid_4.3.0                  cluster_2.1.4              
##  [73] reshape2_1.4.4              generics_0.1.3              gtable_0.3.3                BiocSingular_1.17.1         tidygraph_1.2.3             ScaledMatrix_1.9.1         
##  [79] car_3.1-2                   utf8_1.2.3                  ggrepel_0.9.3               pillar_1.9.0                stringr_1.5.0               yulab.utils_0.0.6          
##  [85] splines_4.3.0               tweenr_2.0.2                treeio_1.25.2               lattice_0.21-8              bit_4.0.5                   tidyselect_1.2.0           
##  [91] DirichletMultinomial_1.43.0 knitr_1.43                  gridExtra_2.3               xfun_0.39                   graphlayouts_1.0.0          DT_0.28                    
##  [97] stringi_1.7.12              ggfun_0.1.1                 lazyeval_0.2.2              yaml_2.3.7                  evaluate_0.21               codetools_0.2-19           
## [103] tibble_3.2.1                ggplotify_0.1.1             cli_3.6.1                   jquerylib_0.1.4             munsell_0.5.0               Rcpp_1.0.11                
## [109] parallel_4.3.0              blob_1.2.4                  sparseMatrixStats_1.13.0    bitops_1.0-7                decontam_1.21.0             viridisLite_0.4.2          
## [115] tidytree_0.4.4              scales_1.2.1                purrr_1.0.1                 crayon_1.5.2                rlang_1.1.1                 cowplot_1.1.1